#%pip install numpy
#%pip install pandas
#%pip install matplotlib
#%pip install ipympl
#%pip install seaborn
#%pip install scikit-learn
%pip install xgboost
%pip install catboost
%pip install lightgbm
Requirement already satisfied: xgboost in c:\users\mohan\anaconda3\lib\site-packages (2.0.1) Requirement already satisfied: numpy in c:\users\mohan\anaconda3\lib\site-packages (from xgboost) (1.24.3) Requirement already satisfied: scipy in c:\users\mohan\anaconda3\lib\site-packages (from xgboost) (1.10.1) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: catboost in c:\users\mohan\anaconda3\lib\site-packages (1.2.2) Requirement already satisfied: graphviz in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (0.20.1) Requirement already satisfied: matplotlib in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (3.7.1) Requirement already satisfied: numpy>=1.16.0 in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (1.24.3) Requirement already satisfied: pandas>=0.24 in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (1.5.3) Requirement already satisfied: scipy in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (1.10.1) Requirement already satisfied: plotly in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (5.9.0) Requirement already satisfied: six in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (1.16.0) Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\mohan\anaconda3\lib\site-packages (from pandas>=0.24->catboost) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\mohan\anaconda3\lib\site-packages (from pandas>=0.24->catboost) (2022.7) Requirement already satisfied: contourpy>=1.0.1 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (1.0.5) Requirement already satisfied: cycler>=0.10 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (4.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (23.0) Requirement already satisfied: pillow>=6.2.0 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (9.4.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (3.0.9) Requirement already satisfied: tenacity>=6.2.0 in c:\users\mohan\anaconda3\lib\site-packages (from plotly->catboost) (8.2.2) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: lightgbm in c:\users\mohan\anaconda3\lib\site-packages (4.1.0) Requirement already satisfied: numpy in c:\users\mohan\anaconda3\lib\site-packages (from lightgbm) (1.24.3) Requirement already satisfied: scipy in c:\users\mohan\anaconda3\lib\site-packages (from lightgbm) (1.10.1) Note: you may need to restart the kernel to use updated packages.
import numpy as np # linear algebra
import pandas as pd # data processing
#Data visualization libraries
import matplotlib.pyplot as plt # data visualization with matplotlib
import seaborn as sns # data visualization with seaborn
# Interactive plots
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go
#Data Profiling
from ydata_profiling import ProfileReport
#Data Preprocessing
from sklearn.preprocessing import StandardScaler
# Machine Learning
from sklearn.model_selection import train_test_split # data split
from sklearn import linear_model
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression # Linear Regression
from sklearn.tree import DecisionTreeRegressor # Decision Tree Regression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor # Random Forest Regression , Gradient Boosting Regression, AdaBoost Regression
from xgboost import XGBRegressor # XGBoost Regression
from catboost import CatBoostRegressor # CatBoost Regression
from lightgbm import LGBMRegressor # LightGBM Regression
from sklearn.metrics import mean_squared_error, r2_score # model evaluation
import statsmodels.api as sm
from scipy import stats
df=pd.read_csv('diamonds.csv')
Data Overview
df.head()
| Unnamed: 0 | carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 2 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 3 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 4 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 5 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 53940 entries, 0 to 53939 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 53940 non-null int64 1 carat 53940 non-null float64 2 cut 53940 non-null object 3 color 53940 non-null object 4 clarity 53940 non-null object 5 depth 53940 non-null float64 6 table 53940 non-null float64 7 price 53940 non-null int64 8 x 53940 non-null float64 9 y 53940 non-null float64 10 z 53940 non-null float64 dtypes: float64(6), int64(2), object(3) memory usage: 4.5+ MB
df.describe()
| Unnamed: 0 | carat | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|
| count | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 |
| mean | 26970.500000 | 0.797940 | 61.749405 | 57.457184 | 3932.799722 | 5.731157 | 5.734526 | 3.538734 |
| std | 15571.281097 | 0.474011 | 1.432621 | 2.234491 | 3989.439738 | 1.121761 | 1.142135 | 0.705699 |
| min | 1.000000 | 0.200000 | 43.000000 | 43.000000 | 326.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 13485.750000 | 0.400000 | 61.000000 | 56.000000 | 950.000000 | 4.710000 | 4.720000 | 2.910000 |
| 50% | 26970.500000 | 0.700000 | 61.800000 | 57.000000 | 2401.000000 | 5.700000 | 5.710000 | 3.530000 |
| 75% | 40455.250000 | 1.040000 | 62.500000 | 59.000000 | 5324.250000 | 6.540000 | 6.540000 | 4.040000 |
| max | 53940.000000 | 5.010000 | 79.000000 | 95.000000 | 18823.000000 | 10.740000 | 58.900000 | 31.800000 |
Overview of the data embedded in this notebook
#dataset_profile=ProfileReport(df, title="Diamond Data Profile")
#dataset_profile.to_notebook_iframe()
Data Profiling Export to HTML
#dataset_profile.to_file("Diamond Detailed Data Profile.html")
df.drop('Unnamed: 0',axis='columns',inplace=True)
df
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 53935 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
| 53936 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
| 53937 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
| 53938 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
| 53939 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
53940 rows × 10 columns
df.isna().sum()
carat 0 cut 0 color 0 clarity 0 depth 0 table 0 price 0 x 0 y 0 z 0 dtype: int64
There is no null values in the dataset
Price Distribution
plt.figure(figsize=(20,8))
sns.histplot(x=df['price'],bins=50,kde=True)
plt.tight_layout()
plt.show()
Relation between Price and Carat, Cut, Color, Clarity, Depth, Table, X, Y, Z
fig,ax=plt.subplots(2,3,figsize=(30,16))
i=0;j=0
for col in (df.select_dtypes(include='float64')):
sns.scatterplot(x=col,y='price',data=df,color='green',ax=ax[i,j])
j+=1
if(j==3):
j=0
i+=1
plt.tight_layout()
plt.show()
# calculate the correlation matrix on the numeric columns
corr = df.select_dtypes('number').corr()
# plot the heatmap
#sns.heatmap(corr, annot=True)
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True)
# Give a title to the heatmap. Pad defines the distance of the title from the top of the heatmap.
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);
C:\Users\mohan\AppData\Local\Temp\ipykernel_72768\2352083839.py:6: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True)
Relation between Price and Cut , Color , Clarity (Categorical)
fig,ax=plt.subplots(1,3,figsize=(30,8))
palette=['deep','coolwarm',None]
for i,col in enumerate(df.select_dtypes(include='object').columns):
sns.barplot(x=col,y='price',data=df,ax=ax[i],palette=palette[i])
plt.tight_layout()
plt.show()
Count of Cut, Color, Clarity (Categorical)
fig,ax=plt.subplots(1,3,figsize=(30,8))
palette=['deep','coolwarm',None]
for i,col in enumerate(df.select_dtypes(include='object').columns):
sns.countplot(x=col,data=df,ax=ax[i],palette=palette[i])
plt.tight_layout()
plt.show()
No such diamond can exist whose length or width or depth is zero, so entries with any of these are abnormal and thus dropping them. Also, elements with width(y)>30 and depth(z)>30 seems to be outliers, so removing them too.
#Make a copy of the original dataset
data_new = df.copy()
data_new
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 53935 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
| 53936 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
| 53937 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
| 53938 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
| 53939 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
53940 rows × 10 columns
Drop the rows with x=0, y=0, z=0 and y>30, z>30
data_new.drop(data_new.loc[(data_new['x']==0)|(data_new['y']==0)|(data_new['z']==0)|(data_new['y']>30)|(data_new['z']>30)].index,inplace=True)
data_new
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 53935 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
| 53936 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
| 53937 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
| 53938 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
| 53939 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
53917 rows × 10 columns
Now we visualize the data after removing the outliers of x, y, z
fig,ax=plt.subplots(2,3,figsize=(30,16))
i=0;j=0
for col in (df.select_dtypes(include='float64')):
sns.scatterplot(x=col,y='price',data=df,color='green',ax=ax[i,j])
j+=1
if(j==3):
j=0
i+=1
plt.tight_layout()
plt.show()
Now we can see that as the carat increases, the price also increases. So, we can say that carat is directly proportional to price.
As we can see the elements with table>80 are outliers, so removing them.
data_new.drop(data_new.loc[data_new['table']>80].index,inplace=True)
data_new
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 53935 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
| 53936 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
| 53937 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
| 53938 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
| 53939 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
53916 rows × 10 columns
data_new[data_new.duplicated()]
data_new.drop_duplicates(inplace=True)
data_new
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 53935 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
| 53936 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
| 53937 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
| 53938 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
| 53939 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
53771 rows × 10 columns
As you can see there are 145 duplicates in the dataset. So, we will remove them.
Encode the categorical data to numerical data so that the machine learning model can understand it. In this dataset we have to encode the Cut, Color and Clarity columns. pd.get_dummies() method encodes the categorical data by creating Dummy columns for each category Dropping the y,z columns as they have p values more than 0.05, it is less significant
#Keep Original Data for further actions
data_Categorical=data_new.copy()
data_new_ready=pd.get_dummies(data_new,columns=['cut','color','clarity'],drop_first=True)
target='price'
X0=data_new_ready.drop([target],axis=1)
y0=data_new_ready[[target]]
X0_train,X0_test,y0_train,y0_test=train_test_split(X0,y0,test_size=0.3,random_state=0,)
Xi = sm.add_constant(X0_train)
esti = sm.OLS(y0_train, Xi)
esti2 = esti.fit()
print(esti2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.921
Model: OLS Adj. R-squared: 0.921
Method: Least Squares F-statistic: 1.900e+04
Date: Wed, 06 Dec 2023 Prob (F-statistic): 0.00
Time: 17:42:06 Log-Likelihood: -3.1768e+05
No. Observations: 37639 AIC: 6.354e+05
Df Residuals: 37615 BIC: 6.356e+05
Df Model: 23
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
const -4412.8443 755.788 -5.839 0.000 -5894.209 -2931.480
carat 1.157e+04 61.803 187.189 0.000 1.14e+04 1.17e+04
depth 48.8664 10.960 4.458 0.000 27.384 70.349
table -23.4001 3.474 -6.736 0.000 -30.209 -16.591
x -1552.9721 122.308 -12.697 0.000 -1792.698 -1313.246
y 1673.1020 125.434 13.338 0.000 1427.248 1918.956
z -2068.0472 168.343 -12.285 0.000 -2398.004 -1738.090
cut_Good 502.4655 40.593 12.378 0.000 422.902 582.029
cut_Ideal 773.3659 40.234 19.222 0.000 694.506 852.226
cut_Premium 737.9933 38.386 19.226 0.000 662.757 813.230
cut_Very Good 643.2127 39.308 16.363 0.000 566.168 720.257
color_E -212.7592 21.218 -10.027 0.000 -254.346 -171.172
color_F -267.4972 21.428 -12.484 0.000 -309.496 -225.498
color_G -480.0944 21.015 -22.846 0.000 -521.284 -438.905
color_H -983.5658 22.439 -43.834 0.000 -1027.546 -939.586
color_I -1479.1774 25.202 -58.692 0.000 -1528.575 -1429.780
color_J -2384.6537 31.072 -76.747 0.000 -2445.555 -2323.752
clarity_IF 5342.6712 60.890 87.743 0.000 5223.325 5462.018
clarity_SI1 3669.8884 51.820 70.820 0.000 3568.320 3771.457
clarity_SI2 2718.7299 51.998 52.285 0.000 2616.813 2820.647
clarity_VS1 4583.3116 52.903 86.637 0.000 4479.621 4687.002
clarity_VS2 4260.1940 52.090 81.785 0.000 4158.096 4362.292
clarity_VVS1 5002.4335 56.086 89.193 0.000 4892.504 5112.363
clarity_VVS2 4942.7829 54.487 90.715 0.000 4835.988 5049.578
==============================================================================
Omnibus: 10020.883 Durbin-Watson: 1.995
Prob(Omnibus): 0.000 Jarque-Bera (JB): 511304.234
Skew: 0.472 Prob(JB): 0.00
Kurtosis: 21.032 Cond. No. 1.13e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
# modify figure size
fig = plt.figure(figsize=(14, 8))
# creating regression plots
fig = sm.graphics.plot_regress_exog(esti2,
'carat',
fig=fig)
eval_env: 1
data_new_ready = data_new_ready.drop(['y', 'z'], axis = 1)
data_new_ready
| carat | depth | table | price | x | cut_Good | cut_Ideal | cut_Premium | cut_Very Good | color_E | ... | color_H | color_I | color_J | clarity_IF | clarity_SI1 | clarity_SI2 | clarity_VS1 | clarity_VS2 | clarity_VVS1 | clarity_VVS2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 61.5 | 55.0 | 326 | 3.95 | 0 | 1 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 0.21 | 59.8 | 61.0 | 326 | 3.89 | 0 | 0 | 1 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0.23 | 56.9 | 65.0 | 327 | 4.05 | 1 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 0.29 | 62.4 | 58.0 | 334 | 4.20 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 0.31 | 63.3 | 58.0 | 335 | 4.34 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 53935 | 0.72 | 60.8 | 57.0 | 2757 | 5.75 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 53936 | 0.72 | 63.1 | 55.0 | 2757 | 5.69 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 53937 | 0.70 | 62.8 | 60.0 | 2757 | 5.66 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 53938 | 0.86 | 61.0 | 58.0 | 2757 | 6.15 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 53939 | 0.75 | 62.2 | 55.0 | 2757 | 5.83 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
53771 rows × 22 columns
I set target variable as Price and rest of the variables as features.
target='price'
X=data_new_ready.drop([target],axis=1)
y=data_new_ready[[target]]
Cheking the content of x , y
X.head(1)
| carat | depth | table | x | cut_Good | cut_Ideal | cut_Premium | cut_Very Good | color_E | color_F | ... | color_H | color_I | color_J | clarity_IF | clarity_SI1 | clarity_SI2 | clarity_VS1 | clarity_VS2 | clarity_VVS1 | clarity_VVS2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 61.5 | 55.0 | 3.95 | 0 | 1 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 rows × 21 columns
y.head(1)
| price | |
|---|---|
| 0 | 326 |
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0,)
We have to scale the features so that the machine learning model can understand it.
#SC_X = StandardScaler()
#X_train = SC_X.fit_transform(X_train)
#X_test = SC_X.transform(X_test)
First define a function for Model Evaluation
training_score = []
testing_score = []
rmse=[]
def model_prediction(model):
d = model.fit(X_train,y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
a = r2_score(y_train,y_train_pred)*100
b = r2_score(y_test,y_test_pred)*100
c = mean_squared_error(y_test, y_test_pred)
training_score.append(a)
testing_score.append(b)
rmse.append(c)
print(f"r2_Score of {model} model on Training Data is:",a)
print(f"r2_Score of {model} model on Testing Data is:",b)
print(f"RMSE of {model} model on Testing Data is:",c)
model_prediction(LinearRegression())
r2_Score of LinearRegression() model on Training Data is: 92.02780118315786 r2_Score of LinearRegression() model on Testing Data is: 91.99397608062448 RMSE of LinearRegression() model on Testing Data is: 1281621.0212671217
model_prediction(DecisionTreeRegressor())
r2_Score of DecisionTreeRegressor() model on Training Data is: 99.99908658591109 r2_Score of DecisionTreeRegressor() model on Testing Data is: 95.21759453879659 RMSE of DecisionTreeRegressor() model on Testing Data is: 765577.4493088272
X2 = sm.add_constant(X_train)
est = sm.OLS(y_train, X2)
est2 = est.fit()
print(est2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.920
Model: OLS Adj. R-squared: 0.920
Method: Least Squares F-statistic: 2.068e+04
Date: Wed, 06 Dec 2023 Prob (F-statistic): 0.00
Time: 17:42:44 Log-Likelihood: -3.1779e+05
No. Observations: 37639 AIC: 6.356e+05
Df Residuals: 37617 BIC: 6.358e+05
Df Model: 21
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
const 3273.8578 467.079 7.009 0.000 2358.370 4189.345
carat 1.154e+04 61.707 187.042 0.000 1.14e+04 1.17e+04
depth -76.4341 4.863 -15.716 0.000 -85.967 -66.902
table -24.9481 3.478 -7.173 0.000 -31.765 -18.131
x -1148.7170 26.132 -43.958 0.000 -1199.937 -1097.497
cut_Good 598.4588 39.983 14.968 0.000 520.091 676.827
cut_Ideal 851.7280 39.872 21.362 0.000 773.578 929.878
cut_Premium 777.4750 38.411 20.241 0.000 702.189 852.761
cut_Very Good 746.3731 38.423 19.425 0.000 671.064 821.682
color_E -211.9465 21.281 -9.959 0.000 -253.658 -170.235
color_F -267.4180 21.492 -12.443 0.000 -309.543 -225.293
color_G -480.6893 21.077 -22.806 0.000 -522.002 -439.377
color_H -986.7398 22.503 -43.849 0.000 -1030.847 -942.633
color_I -1474.0207 25.275 -58.319 0.000 -1523.561 -1424.481
color_J -2381.1576 31.164 -76.407 0.000 -2442.240 -2320.075
clarity_IF 5395.2985 60.904 88.587 0.000 5275.925 5514.672
clarity_SI1 3714.5770 51.850 71.641 0.000 3612.950 3816.204
clarity_SI2 2756.4792 52.065 52.943 0.000 2654.430 2858.528
clarity_VS1 4629.4998 52.918 87.485 0.000 4525.780 4733.220
clarity_VS2 4302.1754 52.128 82.531 0.000 4200.003 4404.348
clarity_VVS1 5049.8087 56.109 89.999 0.000 4939.833 5159.784
clarity_VVS2 4991.0682 54.500 91.580 0.000 4884.248 5097.889
==============================================================================
Omnibus: 10002.758 Durbin-Watson: 1.996
Prob(Omnibus): 0.000 Jarque-Bera (JB): 528015.336
Skew: 0.455 Prob(JB): 0.00
Kurtosis: 21.326 Cond. No. 6.85e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.85e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
# modify figure size
fig = plt.figure(figsize=(14, 8))
# creating regression plots
fig = sm.graphics.plot_regress_exog(est2,
'carat',
fig=fig)
eval_env: 1
# Assuming 'model' is your fitted linear regression model
residuals = est2.resid
# Create a residual normal distribution plot
plt.figure(figsize=(8, 6))
plt.hist(residuals, bins=30, density=True, color='blue', alpha=0.7)
mu, sigma = stats.norm.fit(residuals)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, sigma)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f, std = %.2f" % (mu, sigma)
plt.title(title)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()
model_prediction(RandomForestRegressor())
C:\Users\mohan\AppData\Local\Temp\ipykernel_72768\1466626044.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). d = model.fit(X_train,y_train)
r2_Score of RandomForestRegressor() model on Training Data is: 99.63949723600446 r2_Score of RandomForestRegressor() model on Testing Data is: 97.53962108446606 RMSE of RandomForestRegressor() model on Testing Data is: 393862.59274087363
model_prediction(XGBRegressor())
r2_Score of XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, ...) model on Training Data is: 98.71031434640751
r2_Score of XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, ...) model on Testing Data is: 97.8180153412913
RMSE of XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, ...) model on Testing Data is: 349296.65897145646
model_prediction(GradientBoostingRegressor())
C:\Users\mohan\anaconda3\Lib\site-packages\sklearn\ensemble\_gb.py:437: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
r2_Score of GradientBoostingRegressor() model on Training Data is: 95.42919056247987 r2_Score of GradientBoostingRegressor() model on Testing Data is: 95.35198883069714 RMSE of GradientBoostingRegressor() model on Testing Data is: 744063.3305186994
model_prediction(AdaBoostRegressor())
C:\Users\mohan\anaconda3\Lib\site-packages\sklearn\utils\validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
r2_Score of AdaBoostRegressor() model on Training Data is: 84.70057605755476 r2_Score of AdaBoostRegressor() model on Testing Data is: 84.67999292256461 RMSE of AdaBoostRegressor() model on Testing Data is: 2452458.712855532
model_prediction(LGBMRegressor())
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002506 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 771 [LightGBM] [Info] Number of data points in the train set: 37639, number of used features: 21 [LightGBM] [Info] Start training from score 3931.778103 r2_Score of LGBMRegressor() model on Training Data is: 98.38651059955792 r2_Score of LGBMRegressor() model on Testing Data is: 97.96913972681276 RMSE of LGBMRegressor() model on Testing Data is: 325104.3518710946
model_prediction(CatBoostRegressor(verbose=False))
r2_Score of <catboost.core.CatBoostRegressor object at 0x000002078B65C290> model on Training Data is: 98.52147606503685 r2_Score of <catboost.core.CatBoostRegressor object at 0x000002078B65C290> model on Testing Data is: 97.95277938307015 RMSE of <catboost.core.CatBoostRegressor object at 0x000002078B65C290> model on Testing Data is: 327723.34984897176
create a dataframe for all models comparison
models = ["Linear Regression","Decision Tree Regression","Random Forest Regression","XGBoost" ,"Gradient Boosting Regression","AdaBoost Regression","LGBM Regression","CatBoost Regression"]
compare_models = pd.DataFrame({"Algorithms":models,
"Training Score":training_score,
"Testing Score":testing_score,"RMSE":rmse})
compare_models
| Algorithms | Training Score | Testing Score | RMSE | |
|---|---|---|---|---|
| 0 | Linear Regression | 92.027801 | 91.993976 | 1.281621e+06 |
| 1 | Decision Tree Regression | 99.999087 | 95.217595 | 7.655774e+05 |
| 2 | Random Forest Regression | 99.639497 | 97.539621 | 3.938626e+05 |
| 3 | XGBoost | 98.710314 | 97.818015 | 3.492967e+05 |
| 4 | Gradient Boosting Regression | 95.429191 | 95.351989 | 7.440633e+05 |
| 5 | AdaBoost Regression | 84.700576 | 84.679993 | 2.452459e+06 |
| 6 | LGBM Regression | 98.386511 | 97.969140 | 3.251044e+05 |
| 7 | CatBoost Regression | 98.521476 | 97.952779 | 3.277233e+05 |
Plotting the graph of all the models using their R2 Score with bar plot
compare_models.plot(x="Algorithms",y=["Training Score","Testing Score"], figsize=(16,6),kind="bar",title="Performance Visualization of Different Models by R2Score",colormap="rainbow")
plt.show()
Plotting the graph of all the models using their RMSE with bar plot
compare_models.plot(x="Algorithms",y=["RMSE"], figsize=(16,6),kind="bar",title="Performance Visualization of Different Models by R2Score",colormap="Dark2")
plt.show()
The linear regression model has a strong fit with an R-squared of 0.92, indicating that approximately 92% of the variance in the dependent variable is explained by the independent variables. All predictor variables, including carat, cut, color, clarity, and geometric features, are statistically significant (p < 0.001), contributing meaningfully to the model The model's F-statistic is highly significant (p < 2.2e-16), indicating that the overall regression equation is meaningful. Residual analysis shows a symmetric distribution with a median residual of 0, suggesting that, on average, predicted values underestimate observed values by this amount The model's intercept is 2366.086, and the coefficient for carat is 11256.968, indicating a strong positive relationship between carat and the dependent variable (price) - β0, β2..….., βn have meaning intervals Categorical variables such as cut, color, and clarity have varying effects on the dependent variable, with distinct coefficients for each category, which explains the model well The residual standard error reduced to 1130 after backward regression, providing a measure of the typical deviation of observed values from predicted values.